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Abstract 

We study the problem of supervised learning for 
both binary and multiclass classification from a 
unified geometric perspective. In particular, we 
propose a geometric regularization technique to 
find the submanifold corresponding to a robust 
estimator of the class probability P(y\x). The 
regularization term measures the volume of this 
submanifold, based on the intuition that overfit¬ 
ting produces rapid local oscillations and hence 
large volume of the estimator. This technique can 
be applied to regularize any classification func¬ 
tion that satisfies two requirements: firstly, an es¬ 
timator of the class probability can be obtained; 
secondly, first and second derivatives of the class 
probability estimator can be calculated. In ex¬ 
periments, we apply our regularization technique 
to standard loss functions for classification, our 
RBF-based implementation compares favorably 
to widely used regularization methods for both 
binary and multiclass classification. 

1. Introduction 

In supervised learning for classification, the idea of regu¬ 
larization seeks a balance between a perfect description of 
the training data and the potential for generalization to un¬ 
seen data. Most regularization techniques are defined in the 
form of penalizing some functional norms. For instance, 
one of the most successful classification methods, the sup¬ 
port vector machine (SVM) (Vapnik, 1998; Scholkopf & 
Smola, 2002) and its variants (Bartlett et al., 2006; Stein- 
wart, 2005), use a RKHS norm as a regularizer. While 
functional norm based regularization is widely-used in ma¬ 


chine learning, we feel that there is important local geomet¬ 
ric information overlooked by this approach. 

In many real world classification problems, if the feature 
space is meaningful, then all samples that are locally within 
a small enough neighborhood of a training sample should 
have class probability P(y\x) similar to the training sam¬ 
ple. For instance, a small enough perturbation of RGB 
values at some pixels of a human face image should not 
change dramatically the likelihood of correct identifica¬ 
tion of this image during face recognition. However, such 
“small local oscillations” of the class probability are not 
explicitly incorporated by penalizing commonly used func¬ 
tional norms. For instance, as reported by Goodfellow et al. 
(2014), linear models and their combinations can be eas¬ 
ily fooled by hardly perceptible perturbations of a correctly 
predicted image, even though a L2 regularizer is adopted. 

Geometric regularization techniques have also been studied 
in machine learning. Belkin et al. (2006) employed geo¬ 
metric regularization in the form of the L2 norm of the gra¬ 
dient magnitude supported on a manifold. This approach 
exploits the geometry of the marginal distribution P(x) 
for semi-supervised learning, rather than the geometry of 
the class probability P(y\x). Other related geometric reg¬ 
ularization methods are motivated by the success of level 
set methods in image segmentation (Cai & Sowmya, 2007; 
Varshney & Willsky, 2010) and Euler’s Elastica in image 
processing (Lin et al., 2012; 2015). In particular, the Level 
Learning Set (Cai & Sowmya, 2007) combines a counting 
function of training samples and a geometric penalty on 
the surface area of the decision boundary. The Geomet¬ 
ric Level Set (Varshney & Willsky, 2010) generalizes this 
idea to standard empirical risk minimization schemes with 
margin-based loss. Along this line, the Euler’s Elastica 
Model (Lin et al., 2012; 2015) proposes a regularization 
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technique that penalizes both the gradient oscillations and 
the curvature of the decision boundary. However, all three 
methods focus on the geometry of the decision boundary 
supported in the domain of the feature space, and the “small 
local oscillation” of the class probability is not explicitly 
addressed. 

In this work, we argue that the “small local oscillation” of 
the class probability actually lies in the product space of the 
feature domain and the probabilistic output space, and can 
be characterized by the geometry of a submanifold in this 
product space corresponding to the class probability. Let 
/ : X —> 1 be a class probability estimator, where X 

is the feature space and A L_1 is the probabilistic simplex 
for L classes. From a geometric perspective, if we regard 
{(at,/(at))|at £ X}, the functional graph (in the geometric 
sense) of /, as a submanifold in X x A^ -1 , then “small 
local oscillations” can be measured by the local flatness of 
this submanifold. 

In our approach, the learning process can be viewed as a 
submanifold fitting problem that is solved by a geometric 
flow method. In particular, our approach finds a submani¬ 
fold by iteratively fitting the training samples in a curvature 
or volume decreasing manner without any a priori assump¬ 
tions on the geometry of the submanifold in X x A i_1 . We 
use gradient flow methods to find an optimal direction, i.e. 
at each step we find the vector field pointing in the optimal 
direction to move f. As we will see in the next section, this 
regularization approach naturally handles binary and mul¬ 
ticlass classification in a unified way, while previous deci¬ 
sion boundary based techniques (and most functional reg¬ 
ularization approaches) are originally designed for binary 
classification, and rely on “one versus one”, “one versus 
all” or more efficiently a binary coding strategy (Varshney 
& Willsky, 2010) to generalize to multiclass case. 

In experiments, a radial basis function (RBF) based im¬ 
plementation of our formulation compares favorably to 
widely used binary and multiclass classification methods 
on datasets from the UCI repository and real-world datasets 
including the Flickr Material Database (FMD) and the 
MNIST Database of handwritten digits. 

In summary, our contributions are: 

• A geometric perspective on overfitting and a regular¬ 
ization approach that exploits the geometry of a robust 
class probability estimator for classification, 

• A unified gradient flow based algorithm for both bi¬ 
nary and multiclass classification that can be applied 
to standard loss functions, and 

• A RBF-based implementation that achieves promising 
experimental results. 



Figure 1. Example of three-class learning, i.e., L = 3, where the 
input space X is 2d. Training samples of the three classes are 
marked with red, green and blue dots respectively. The class label 
for each training sample corresponds to a vertex of the simplex 
A i_1 . As a result, each mapped training point (xi, zi) lies on 
one face (corresponding to its label yi) of the space Ax A 2 . 


2. Method Overview 

In our work, we propose a regularization scheme that ex¬ 
ploits the geometry of a robust class probability estimator 
and suggest a gradient flow based approach to solve for it. 
In the follow, we will describe our approach. Related math¬ 
ematical notation is summarized in Table 2. 

Following the probabilistic setting of classification, given 
a sample (feature) space X C a label space y = 
{1 , ...,L}, and a finite training set of labeled samples 
Tm = where each training sample is gen¬ 

erated i.i.d. from distribution P over X x y, our goal is 
to find a hj- m : X —> y such that for any new sample 
x £ X, hj- m predicts its label y = h-j- m (x). The optimal 
generalization risk (Bayes risk) is achieved by the classifier 
h*(x) = argmax{ 7 y f (x), £ £ 32 }, where rj = (ry 1 ,..., r ] L ) 
with rf : X —» [0,1] being the £ th class probability, i.e. 
7 f(x) = P(y = l\x). 

Our regularization approach exploits the geometry of the 
class probability estimator, and can be regarded as a “hy¬ 
brid” plug-in/ERM scheme (Audibert & Tsybakov, 2007). 
A regularized loss minimization problem is setup to find 
an estimator f : X —y A L_1 , where A L_1 is the stan¬ 
dard (L — l)-simplex in R L , and / = (/ 1 ,...,/ L ) is 
an estimator of 77 with f e : X —> [0,1]. The estima¬ 
tor / is then “plugged-in” to get the classifier hf(x) = 
argmax{/ f (at),£ £ y}. 

Figure 1 shows an example of the setup of our approach, 
for a synthetic three-class classification problem. The sub¬ 
manifold corresponding to estimator / is the graph (in the 
geometric sense) of /: gr (/) = {(at, f x (x),f L (x)) : 
x £ X} C X x A L_1 . We denote a point in the space 
X x A i_1 as (at, z) = (x 1 ,..., x N , z 1 ,..., z L ), where 
at £ X and z £ A i_1 . Then in this product space. 
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A 1 c= E 2 



(a) ( b ) 


A 1 c E 2 





(c) 



Figure 2. Example of binary learning via gradient flow. As shown in (a), the feature space X is 2d, training points are sampled uniformly 
within the region [—15,15] x [—15,15], and labeled by the function y = sign(10 — 11a;|| 2 ) (the red circle). In the initialization step, 
shown in (6), positive and negative training points map to the two faces of the space X x A 1 respectively. Our gradient flow method 
starts from a neutral function / 0 s | and moves towards the negative direction (red and blue arrows) of the penalty gradient VP/ 0 . 
Figure (c) shows the submanifold (gr(/ 1 )) one step after (6). The submanifold then continues to evolve towards —VP/ step by step 
and the final output after convergence of the algorithm is shown in ( d ). 


a training pair ( Xi,yi = £) naturally maps to the point 
(xi, Zi ) = (xi, 0 ,..., 1,..., 0), with the one-hot vector 
Zi (with the 1 in its £-th slot) at the vertex of A L_1 corre¬ 
sponding to P(y = yi\x) = 1. 

We point out two properties of this geometric setup. Firstly, 
it inherently handles multiclass classification, with binary 
classification as a special case. Secondly, while the dimen¬ 
sion of the ambient space, i.e. R Ar+i , depends on both the 
feature dimension N and number of classes L, the intrinsic 
dimension of the submanifold gr(/) only depends on N. 

2.1. Variational formulation 

We want gr(/) to approach the mapped training points 
while remaining as flat as possible, so we impose a penalty 
on f consisting of an empirical loss term P% ri and a geo¬ 
metric regularization term Vg- For 'P% ri - we can choose ei¬ 
ther the widely-used cross-entropy loss function for multi¬ 
class classification or the simpler Euclidean distance func¬ 
tion between the simplex coordinates of the graph point and 
the mapped training point. For Pa, we would ideally con¬ 
sider an L 2 measure of the Riemann curvature of gr (/), as 
the vanishing of this term gives optimal (i.e., locally distor¬ 
tion free) diffeomorphisms from gr(/) to 1S. N . However, 
the Riemann curvature tensor takes the form of a combina¬ 
tion of derivatives up to third order, and the corresponding 
gradient vector field is even more complicated and ineffi¬ 
cient to compute in practice. As a result, we measure the 
graph’s volume. Pdf) = J gr (f) dvol, where dvol is the 
induced volume from the Lebesgue measure on the ambi¬ 
ent space R A,+i . 

More precisely, we find the function that minimizes the fol¬ 
lowing penalty P: 

V = Pr m + XPg : M = Maps(A’, A L_1 ) -a M (1) 

on the set M of smooth functions from X to A L_1 , where 
A is the tradeoff parameter between empirical loss and reg¬ 


ularization. It is important to note that any relative scal¬ 
ing of the domain X will not affect the estimate of the 
class probability 77 , as scaling will distort gr(/) but will 
not change the critical function estimating r/. 

2.2. Gradient flow and geometric foundation 

The standard technique for solving variational formulas is 
the Euler-Lagrange PDE. However, due to our geomet¬ 
ric term Pg, finding the minimal solutions of the Euler- 
Lagrange equations for V is difficult, instead, we solve for 
argmin V using gradient flow in functional space A4. 

A simple but intuitive simulated example of binary learn¬ 
ing using gradient flow for our approach is given in Fig¬ 
ure 2. For the explanation purposes only, we replace A4 
with a finite dimensional Riemannian manifold M. With¬ 
out loss of generality, we also assume that V is smooth, 
then it has a differential dPf : TfM -A R. for each / £ M, 
where TfM is the tangent space to M at f. Since dVf is 
a linear functional on TfM, there is a unique tangent vec¬ 
tor, denoted VP/, such that dPf(v) = ( v - VP/) for all 
v £ TfM. PJVf points in the direction of maximal in¬ 
crease of V at /. Thus, the solution of the negative gra¬ 
dient How df t /dt = —VP/ is a flow line of steepest 
descent starting at an initial f 0 . For a dense open set of 
initial points. How lines approach a local minimum of P at 
t —>• 00 . We always choose the initial function f 0 to be the 
“neutral” choice f 0 (x) = (j-,..., T) which reasonably 
assigns equal conditional probability to all classes. 

Similar gradient flow procedures are widely used in 
variational problems, such as level set methods (Osher 
& Sethian, 1988; Sethian, 1999), Mumford-Shah func¬ 
tional (Mumford & Shah, 1989), etc. In the classification 
literature, Varshney & Willsky (2010) were the first to use 
gradient flow methods for solving level set based energy 
functions, then followed by Lin et al. (2012; 2015) to solve 
Euler’s Elastica models. In our case, we are exploiting the 
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geometry in the space X x A L 1 , rather than standard vec¬ 
tor spaces. 

Since our gradient flow method is actually applied on the 
infinite dimensional manifold Ai, we have to understand 
both the topology and the Riemannian geometry of Ai. 
For the topology, we put the Frechet topology on M' = 
Maps (A", R l ), the set of smooth maps from X to R L , and 
take the induced topology on Ai. Intuitively speaking, two 
functions in Ai are close if the functions and all their par¬ 
tial derivatives are pointwise close. Since Ai is an open 
Frechet submanifold with boundary inside the vector space 
Ai', so as with an open set in Euclidean space, we can 
canonically identify TfAi with Ai'. For the Riemannian 
metric, we take the L 2 metric on each tangent space TfAi: 
(</>i,</> 2 ) := f x <t>i(x)4>2{.x)<bfo\ x , with fa e M' and 
(1 vo 1 being the volume form of the induced Riemannian 
metric on the graph of /. (Strictly speaking, the volume 
form is pulled back to X by /, usually denoted by /*dvol.) 

The differential dVf is linear as above, and by a direct cal¬ 
culation, there is a unique tangent vector VP/ G TfAi 
such that dVf(fa) = (VP/, fa) for all <fi € TfAi. Thus, we 
can construct the gradient flow equation. However, unlike 
the case of finite dimensions, the existence of flow lines 
is not automatic. Assuming the existence of flow lines, a 
generic initial point flows to a local minimum of P. In any 
case, our RBF-based implementation in §3 mimicking gra¬ 
dient flow is well defined. 

Note that we think of X as large enough so that the train¬ 
ing data actually is sampled well inside X. This allows 
us to treat X as a closed manifold in our gradient calcu¬ 
lations, so that boundary effects can be ignored. A simi¬ 
lar natural boundary condition is also adopted by previous 
work (Varshney & Willsky, 2010; Lin et ah, 2012; 2015). 

2.3. More on related work 

There exist some other works that are related to some as¬ 
pects of our work. Most notably, Sobolev regularization, 
involves functional norms of a certain number of deriva¬ 
tives of the prediction function. For instance, the manifold 
regularization (Belkin et ah, 2006) mentioned in §1 uses a 
Sobolev regularization term, 

[ \\V M f\\ 2 dP(x), (2) 

J x£A4 

where / is a smooth function on manifold Ai. A discrete 
version of (2) corresponds to the graph Laplacian regular¬ 
ization (Zhou & Scholkopf, 2005). Lin et ah (2015) dis¬ 
cussed in detail the difference between a Sobolev norm and 
a curvature-based norm for the purpose of exploiting the 
geometry of the decision boundary. 

Lor our purpose, while imposing, say, a high Sobolev 


norm 1 , will also lead to a flattening of the hypersurface 
gr(/), these norms are not specifically tailored to measur¬ 
ing the flatness of gr(/). In other words, a high Sobolev 
norm bound will imply the volume bound we desire, but 
not vice versa. As a result, imposing high Sobolev norm 
constraints (regardless of computational difficulties) over¬ 
shrinks the hypothesis space from a learning theory point 
of view. In contrast, our regularization term (given in (1 1)) 
involves only the combination of first derivatives of / that 
specifically address the geometry behind the “small local 
oscillation" prior observed in practice. 

Our training procedure for finding the optimal graph of a 
function is, in a general sense, also related to the manifold 
learning problem (Tenenbaum et ah, 2000; Roweis & Saul, 
2000; Belkin & Niyogi, 2003; Donoho & Grimes, 2003; 
Zhang & Zha, 2005; Lin & Zha, 2008). The most closely 
related work is (Donoho & Grimes, 2003), which seeks a 
flat submanifold of Euclidean space that contains a dataset. 
Again, there are key differences. Since the goal of (Donoho 
& Grimes, 2003) is dimensionality reduction, their mani¬ 
fold has high codimension, while our functional graph has 
codimension L — 1, which may be as low as 1. More impor¬ 
tantly, we do not assume that the graph of our target func¬ 
tion is a flat (or volume minimizing) submanifold, and we 
instead flow towards a function whose graph is as flat (or 
volume minimizing) as possible. In this regard, our work is 
related to a large body of literature on Morse theory in finite 
and infinite dimensions, and on mean curvature flow (Chen 
et al., 1999; Mantegazza, 2011). 

3. Example Formulation: RBFs 

We now illustrate our approach using an RBL representa¬ 
tion of our estimator f. RBFs are also used by previous ge¬ 
ometric classification methods (Varshney & Willsky, 2010; 
Lin et ah, 2012; 2015). 

Given values of / are probabilistic vectors, it is common to 
represent / as a “softmax” output of RBFs, i.e. 

P = -=x -where h° = ^ a\<pi{x), 

2^i =l e 2=1 

for j = 1,..., L, (3) 

where <fii(x) = e~ c ll^-^dl 2 j s the RBF function centered 
at training sample Xi, with kernel width parameter c. 

Estimating / becomes an optimization problem for the m x 
L coefficient matrix A = («(). The following equation 
determines A: 

\h{x 1 ),...,h(x rn )] T = G A, where Gij = (fjixfa. ( 4 ) 

'“High Sobolev norm” is the conventional term for Sobolev 
norm with high order of derivatives. 
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To plug this RBF representation into our gradient flow 
scheme, the gradient vector field Wf is evaluated at each 
sample point Xi, and A is updated by 


A<- A - tG- 1 [W\(*i), ..., VV h (x m )] T , (5) 


where r is the step-size parameter, and 


S7T h {xi) 



Wf(xi). 


( 6 ) 


Here Whixf denotes the gradient vector field w.r.t. h 
evaluated at Xi, and the Lx L Jacobian matrix 


can 


L J Xi 

be obtained in closed form from (3). In the following sub¬ 
sections, we give exact forms of the empirical penalty Vt t „ 
and the geometric penalty V G , and discuss the computation 
of Wh for both penalty terms. 


3.1. The empirical penalty Vr m 

We consider two widely-used loss functions for the empir¬ 
ical penalty term P % n . 

Quadratic loss. Since Tj- m measures the deviation of 
gr(/) from the mapped training points, it is natural to 
choose the quadratic function of the Euclidean distance in 
the simplex A L_1 , 


3.2. The geometric penalty V G 

As discussed in §2, we wish to penalize graphs for exces¬ 
sive curvature and we use the following function, which 
measures the volume of the gr(/): 


~P G {f) = [ dvol = 
Jgr(f) 


>Sr(f) 


\J det (g)dx 1 ... dx 


N 

7 

(ID 


where g = (gij) with gij = Sij + /“ f°. is the Riemma- 
nian metric on gr(/) induced from the standard dot product 
on R ;V 1 L . We use the summation convention on repeated 
indices. Note that this regularization term is clearly very 
different from the standard Sobolev norm of any order. 


It is standard that XV G = —TV II G ]R Ar+i on the space 
of all embeddings of X in R N f L . If we restrict to the sub¬ 
manifold of graphs of f G M!, it is easy to calculate that 
the gradient of geometric penalty (11) is 


W G ,f = V GJ = - TrII L , (12) 


where TV ll L denotes the last L components of Tr II. Then 
the geometric gradient w.r.t. h is 


VPg./i = Vg,h 



TrII L . 


(13) 


Evaluation of 


df 

dh 


and Tr II L at x t leads to ’W G ^h( x i). 


VtM) = J2 ll/(*i)-*ill 2 . (7) 

i =1 


where z t is the one-hot vector corresponding to the ground 
truth label of X{. The gradient vector w.r.t. / evaluated at 
Xi is 


'V' p T m j{xi) = 2(f(xi) - Zi). 

The gradient vector w.r.t. h evaluated at Xi is 

T 


WTm-h{Xi) = 2 


df 


dh 


(f(.Xi) - Zi), 


( 8 ) 


The formulation given above is general in that it encom¬ 
passes both the binary and the multiclass cases. For both 
cases, evaluation of at the training points is the same 

as that in (6), and evaluation of Tr 11 L at any point x can 
be performed explicitly by the following theorem. 


Theorem 1. For f : S. N —> A L 1 , Tr II L for gr(/) is 
given by 


Tr II L 


(i4) 


evaluation of 



is the same as in (6). 


L -1 Xi 

Cross-entropy loss. The cross-entropy loss function is 
widely-used for probabilistic output in classification, 


also 


where /“, /“ denote partial derivatives of f a . 

The proof is in Appendix A. Note that for our RBF rep¬ 
resentation (3), the partial derivatives /“ can be easily 
obtained in closed form. 


m L 

7V m (/) = ~EE^ lo S/^)’ (9) 

i=l 1=1 

whose gradient vector field w.r.t. h evaluated at x, is 

^'PT m ,h(x i ) = f{xf) - z i. (10) 


Simplex constraint. The class probability estimators / : 
X —> A i_1 always takes values in A i_1 C M 11 . While 
this constraint is automatically satisfied for the flow of the 
empirical gradient vector formula (8) and (10), it may fail 
for the flow of geometric gradient vector formula (12). 
There are two ways to enforce this constraint for the geo¬ 
metric gradient vector field. First, since our initial function 
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/ 0 takes values at the center of A L 1 , we can orthogonally 
project the geometric gradient vector Vgj to Vq ^ in the 

tangent space Z = {(y 1 , ...,y L ) el 1 : J2e= 1 / = °l 
of the simplex, and then scale tVq ^ (r is the stepsize) to 
ensure that the range of the new f 1 lies in A L 1 . We then 
iterate. More simply, we can select L — 1 of the L com¬ 
ponents of f(x), call the new function f' : X —> f? L_1 , 
and compute the (L — l)-dimensional gradient vector Vc.f 
following (12) and (14). The omitted component of the de¬ 
sired L-gradient vector is determined by — Yle=~i /'> 
by the definition of Z. Our implementation reported fol¬ 
lows this second approach, where we choose the {L — 1) 
components of f by omitting the component corresponding 
to the class with least number of training samples. 


3.3. Algorithm summary 

Algorithm 1 gives a summary of the classifier learning pro¬ 
cedure. Input to the algorithm is the training set T m , RBF 
kernel width c, trade-off parameter A, and step-size param¬ 
eter r. For initialization, our algorithm first initializes the 
function values of h and / for every training point, and 
then constructs matrix G and solves for A by (4). In the 
subsequent steps, at each iteration, our algorithm first eval¬ 
uates the gradient vector field V'P/,, at every training point, 
then updates coefficient matrix A by (5). For the overall 
penalty function V = 'Pt,„ + A'Pcs we compute the total 
gradient vector field VP/,, evaluated at x L , 


yTh(xi) = \lVr m j{xi ) + A X7Va,f(x z ) 
" lT - Zi ) - ATI ll L 


21 

dh 


f(Xi) - - A 


l T 


TrlP 


(15) 

quadratic 
cross-entropy. 


Our algorithm iterates until it converges or reaches the 
maximum iteration number. 


The same algorithm applies to both the quadratic loss and 
the cross-entropy loss. To evaluate the total gradient vec¬ 
tors Wh(xi) in each iteration, for the quadratic loss, we 
use (8) and (13) to compute the total gradient vector (22); 
for the cross-entropy loss, we use (10) and (13) instead. 
The remaining steps of the procedure are exactly the same 
for both loss functions. 


The final predictor learned by our algorithm is given by 
F{x) = aigmax{f e (x),£ <E {1,2, • • • ,L}}. (16) 


4. Experiments 

To evaluate the effectiveness of the proposed regulariza¬ 
tion approach, we compare our RBF-based implementation 
with two groups of related classification methods. The first 


Algorithm 1 Geometric regularized classification 

Input: training data T m = {(•'«,, y l )} r [2 l , RBF kernel 
width c, trade-off parameter A, step-size t 
Initialize: h(x r ) = ( 1 ,..., 1 = d, ■ ■ ■, 

Vi £ {1, • • • , m}, construct matrix G and solve A by (4) 

for t = 1 to T do 

- Evaluate the total gradient vector Wh (x ,) at ev¬ 
ery training point according to (22). 

- Update the A by (5). 

end for 

Output: class probability estimator / given by (3). 


group of methods are standard RBF-based methods that use 
different regularizes than ours. The second group of meth¬ 
ods are previous geometric regularization methods. 

In particular, the first group includes the Radial Basis Func¬ 
tion Network (RBN), SVM with RBF kernel (SVM) and 
the Import Vector Machine (IVM) (Zhu & Hastie, 2005) 
(a greedy search variant of the standard RBF kernel logis¬ 
tic regression classifier). Note that both SVM and IVM 
use RKHS regularizes and the IVM also uses the similar 
cross-entropy loss as Ours-CE. 

The second group includes the Level Learning Set classi¬ 
fier (Cai & Sowmya, 2007) (LLS), the Geometric Level Set 
classifier (Varshney & Willsky, 2010) (GLS) and the Eu¬ 
ler’s Elastica classifier (Lin et al., 2012; 2015) (EE). Note 
that both GLS and EE use RBF representations and EE also 
uses the same quadratic distance loss as Ours-Q. 

We test both the quadratic loss version (Ours-Q) and the 
cross-entropy loss version (Ours-CE) of our implementa¬ 
tion. 

4.1. UCI datasets 

We tested our classification method on four binary classi¬ 
fication datasets and four multiclass classification datasets. 
Given that Varshney & Willsky (2010) has covered several 
methods on our comparing list and their implementation 
is publicly available, we choose to use the same datasets 
as (Varshney & Willsky, 2010) and carefully follow the ex¬ 
act experimental setup. Tenfold cross-validation error is 
reported. For each of the ten folds, the kernel-width con¬ 
stant c and tradeoff parameter A are found using fivefold 
cross-validation on the training folds. All dimensions of 
input sample points are normalized to a fixed range [0,1] 
throughout the experiments. We select c from the set of val¬ 
ues {1/2 5 ,1/2 4 ,1/2 3 ,1/2 2 ,1/2,1, 2,4, 8} and A from the 
set of values {1/1.5 4 ,1/1.5 3 ,1/1.5 2 ,1/1.5,1,1.5} that 
minimizes the fivefold cross-validation error. The step-size 
r = 0.1 and iteration number T = 5 are fixed over all 
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Table 1. Tenfold cross-validation error rate (percent) on four binary and four multiclass classification datasets from the UCI machine 
learning repository. (L, N) denote the number of classes and input feature dimensions respectively. We compare both the quadratic 
loss version (Ours-Q) and the cross-entropy loss version (Ours-CE) of our method with 6 RBF-based classification methods and (or) 
geometric regularization methods: SVM with RBF kernel (SVM), Radial basis function network (RBN), Level learning set classifier (Cai 
& Sowmya, 2007) (LLS), Geometric level set classifier (Varshney & Willsky, 2010) (GLS), Import Vector Machine (Zhu & Hastie, 2005) 
(IVM), Euler’s Elastica classifier (Lin et al., 2012; 2015) (EE). The mean error rate averaged over all eight datasets is shown in the bottom 
row. Top performance for each dataset is shown in bold. 


Dataset (L, N) 

RBN 

SVM 

IVM 

LLS 

GLS 

EE 

Ours-Q 

Ours-CE 

Pima(2,8) 

24.60 

24.12 

24.11 

29.94 

25.94 

23.33 

23.98 

24.51 

WDBC(2,30) 

5.79 

2.81 

3.16 

6.50 

4.40 

2.63 

2.63 

2.63 

LlVER(2,6) 

35.65 

28.66 

29.25 

37.39 

37.61 

26.33 

25.74 

26.31 

IONOS.(2,34) 

7.38 

3.99 

21.73 

13.11 

13.67 

6.55 

6.83 

6.26 

Wine(3,13) 

1.70 

1.11 

1.67 

5.03 

3.92 

0.56 

0.00 

0.00 

Iris(3,4) 

4.67 

2.67 

4.00 

3.33 

6.00 

4.00 

3.33 

3.33 

Glass(6,9) 

34.50 

31.77 

29.44 

38.77 

36.95 

32.28 

29.87 

29.44 

Segm.(7,19) 

13.07 

3.81 

3.64 

14.40 

4.03 

8.80 

2.47 

2.73 

ALL-AVG 

15.92 

12.37 

14.63 

18.56 

16.57 

13.06 

11.86 

11.90 


datasets. We used the same settings for both loss functions. 

Table 1 reports the results of this experiment. The top per¬ 
former for each dataset is marked in bold, and the aver¬ 
aged performance of each method over all testing datasets 
is summarized in the bottom row. The numbers for RBN, 
LLS and GLS are copied from Table 1 of (Varshney & Will- 
sky, 2010). Results for SVM and IVM are obtained by run¬ 
ning publicly available implementations for SVM (Chang 
& Lin, 2011) and IVM (Roscher et al., 2012). Results for 
EE are obtained by running an implementation provided by 
the authors of (Lin et al., 2012). When running these im¬ 
plementations, we followed the same experimental setup as 
described above and exhaustively searched for the optimal 
range for the kernel bandwidth and the trade-off parameter 
via cross-validation. 

As shown in the last row of Table 1 , two versions of our 
approach are overall the top two performers among all re¬ 
ported methods. In particular, Ours-Q attains top perfor¬ 
mance on four out of the eight benchmarks, Ours-CE at¬ 
tains top performance on three out of the eight benchmarks. 
The performance of the two versions of our method are 
very close, which shows the robustness of our geometric 
regularization approach cross different loss functions for 
classification. Note that three pairs of comparisons, IVM 
vs Ours-CE, GLS vs Ours-Q/Ours-CE, and EE vs Ours-Q 
are of particular interest. We are going to discuss them in 
detail respectively. 

The IVM method of kernel logistic regression uses the 
same RBF-based implementation and very similar cross¬ 
entropy loss as our cross-entropy version Ours-CE, and 
both methods handle the multiclass case inherently. The 
main difference lies in regularization, i.e., the standard 
RKHS norm regularizer vs our geometric regularizer. 
Ours-CE outperforms IVM on six of the eight benchmars 


in Table 1, and achieves equal performance on one of the 
remaining two, and is only slightly behind on “PIMA”. The 
overall superior performance of Ours-CE demonstrates the 
advantage of the proposed geometric regularization over 
the standard RKHS norm regularization. 

The GLS method uses the same RBF-based implementa¬ 
tion as ours and also exploits volume geometry for regu¬ 
larization. As described in §1, however, there are key dif¬ 
ferences between the two regularization techniques. GLS 
measures the volume of the decision boundary supported 
in X , while our approach measures the volume of a sub¬ 
manifold supported in X x A /_ l that corresponds to the 
class probability estimator. Our regularization technique 
handles the binary and multiclass cases in a unified frame¬ 
work, while the decision boundary based techniques, such 
as GLS (and EE), were inherently designed for the binary 
case and rely on a binary coding strategy to train log 2 L 
decision boundaries to generalize to the multiclass case. 
In our experiments, both Ours-Q and Ours-CE outperform 
GLS on all the benchmarks we have tested. This demon¬ 
strates the effectiveness of exploiting the geometry of the 
class probability in addressing the “small local oscillation” 
for classification. 

The EE method of Euler’s Elastica model uses the same 
RBF-based implementation and the same quadratic loss 
as our quadratic loss version Ours-Q. The main differ¬ 
ence, again, lies in regularization, i.e., a combination of 1- 
Sobolev norm and curvature penalty on the decision bound¬ 
ary vs our volume penalty on the submanifold correspond¬ 
ing to the class probability estimator. Since EE adopts a 
combination of sophisticated geometric measures on the 
decision boundary, which fit specifically the binary case, 
it achieves top performance on binary datasets. However, 
as explained in §1, the geometry of the class probability for 
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Table 2. Notations 


hf(x) = argmax : plug-in classifier of f : X —»• A L 1 

eey 

A l 1 : the standard(L — l)-simplex in R L ; 

rj( x) = (r] 1 (x ),..., r/ L (x)) : class probability: rf(x) = P(y = l\x) 

M : {f : X A l 1 ) : f £ C 00 } 

M' : {f : X ^R L : f £ C°°} 

TfM : the tangent space to M at some / £ — M.' 

The graph of / £ M (or M') : gr(/) = {(at, /(at)) : at £ X} 

gij = jA; : The Riemannian metric on gr(/)induced from the standard dot product on R JV+L 

(fl IJ ) = 9~\ with g = (gij)i,j=i,...,N 

dvol = ^ydet{g)dx 1 ... dx N , the volume element on gr(/) 

{ei}f J - 1 : a smoothly varying orthonormal basis of the tangent spaces T( 3 .,y( !I ,)gr(/)of the graph of / 

Tr II : the trace of the second fundamental form of gr(/), Tr II £ R N+L 

Tr II = (£,-i A= 4 ei) : with _L the orthogonal projection to the subspace perpendicular to the 

tangent space of gr(/) and D y w the directional derivative of w in y direction 

Tr 11^ : the projection of Tr II onto the last L coordinates of R iv+1/ 

V"P : the gradient vector field of a function V : M —> R on a possibly infinite dimensional manifold M 


general classification, which is captured by our approach, 
cannot be captured by decision boundary based techniques. 
That is the reason why Ours-Q, a general scheme for both 
the binary and multiclass case, outperforms EE on all four 
multiclass datasets, while it still achieves top performance 
on binary datasets. This again demonstrates our geomet¬ 
ric perspective and regularization approach that exploits the 
geometry of the class probability. 

4.2. Real-world datasets 

To test the scalability of our method to high dimen¬ 
sional and large-scale problems, we also conduct exper¬ 
iments on two real-world datasets, i.e., the Flickr Ma¬ 
terial Database (FMD) for image classification and the 
MNIST (MNIST) Database of handwritten digits. 

FMD (4096 dimensional). The FMD dataset contains 10 
categories of images with 100 images per category. We ex¬ 
tract image features using the SIFT descriptor augmented 
by its feature coordinates, implemented by the VFFeat li¬ 
brary (VFFeat). With this descriptor, Bag-of-visual-words 
uses 4096 vector-quantized visual words, histogram square 
rooting, followed by F2 normalization. We compare our 
method with an SVM classifier with RBF kernels, using 
exactly the same 4096 dimensional feature. Our method 
achieves a correct classification rate of 48.8% while the 
SVM baseline achieves 46.4%. Note that while recent 
works (Qi et ah, 2015; Cimpoi et ah, 2015) report better 
performance on this dataset, the effort focuses on better 
feature design, not on the classifier itself. The features used 
in those works, such as local texture descriptors and CNN 
features, are more sophisticated. 

MNIST (60,000 samples). The MNIST dataset contains 


10 classes (0 ~ 9) of handwritten digits with 60, 000 sam¬ 
ples for training and 10, 000 samples for testing. Each sam¬ 
ple is a 28 x 28 grey scale image. We use 1000 RBFs to 
represent our function /, with RBF centers obtained by ap¬ 
plying K-means clustering on the training set. Note that 
our learning and regularization approach still handles all 
the 60, 000 training samples as described by Algorithm 1. 
Our method achieves an error rate of 2.74%. While there 
are many results reported on this dataset, we feel that the 
most comparable method with our representation is the Ra¬ 
dial Basis Function Network with 1000 RBF units (FeCun 
et ah, 1998), which achieves an error rate of 3.6%. This 
experiment shows the potential that our geometric regular¬ 
ization approach scales to larger datasets. 

5. Discussion 

Our geometric regularization approach can also be viewed 
as a combination of common physical models. As illus¬ 
trated in Figure 1 and 2, each training pair (x,y) corre¬ 
sponds to a point at one of the vertices of the simplex asso¬ 
ciated with a;. As a result, all training data lie on the bound¬ 
ary of the space X x A L 1 , while the functional graph of 
a class probability estimator / is a hypersurface (subman¬ 
ifold) in X x A L 1 . Ail initial estimator without training 
information corresponds to the flat hyperplane in the neu¬ 
tral position. In response to the presence of the training 
data, this neutral hypersurface deforms towards the train¬ 
ing data, as if attracted by a gravitational force due to point 
masses centered at the training points. Simultaneously, the 
regularization term forces the hypersurface to remain as flat 
(or as volume minimizing) as possible, as if in the presence 
of surface tension. Thus this term follows the physics of 
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soap films and minimal surfaces (Dierkes et al., 1992). Ge¬ 
ometric flows like the one proposed here are often modeled 
on physical processes. In our case, the flow can be viewed 
as a mixed gravity and surface tension physical experiment. 

6. Conclusion 

We have introduced a new geometric perspective on regu¬ 
larization for classification that exploits the geometry of a 
robust class probability estimator. Under this perspective, 
we propose a general regularization approach that applies 
to both binary and multiclass cases in a unified way. In ex¬ 
periments with an example formulation based on RBFs, our 
implementation achieves favorable results comparing with 
widely used RBF-based classification methods and previ¬ 
ous geometric regularization methods. While experimen¬ 
tal results demonstrate the effectiveness of our geometric 
regularization technique, it is also important to study con¬ 
vergence properties of this approach from a learning the¬ 
ory perspective. As an initial attempt, we have established 
Bayes consistency for an easy case of empirical penalty 
function and details are provided in Appendix B. We will 
continue this study in the future. 

References 

Audibert, Jean-Yves and Tsybakov, Alexandre. Fast learn¬ 
ing rates for plug-in classifiers. Annals of Statistics, 35 
(2):608-633,2007. 

Bartlett, Peter L, Jordan, Michael I, and McAuliffe, Jon D. 
Convexity, classification, and risk bounds. Journal 
of the American Statistical Association, 101(473): 138— 
156, 2006. 

Belkin, Mikhail and Niyogi, Partha. Laplacian eigen- 
maps for dimensionality reduction and data representa¬ 
tion. Neural Computation, 15(6): 1373—1396,2003. 

Belkin, Mikhail, Niyogi, Partha, and Sindhwani, Vikas. 
Manifold regularization: A geometric framework for 
learning from labeled and unlabeled examples. Journal 
of Machine Learning Research, 7:2399-2434,2006. 

Cai, Xiongcai and Sowmya, Arcot. Level learning set: A 
novel classifier based on active contour models. In Proc. 
European Conf. on Machine Learning (ECML), pp. 79- 
90. 2007. 

Chang, Chih-Chung and Lin, Chih-Jen. LIBSVM: 
A library for support vector machines. ACM 
Transactions on Intelligent Systems and Technol¬ 
ogy, 2:27:1-27:27, 2011. Software available at 
http://w w w. csie.ntu.edu. tw/~cj lin/libs vm. 

Chen, Yun-Gang, Giga, Yoshikazu, and Goto, Shun’ichi. 
Uniqueness and existence of viscosity solutions of gen¬ 


eralized mean curvature flow equations. In Fundamental 
contributions to the continuum theory of evolving phase 
interfaces in solids, pp. 375—412. Springer, Berlin, 1999. 

Devroye, Luc, Gyorfi, Laszlo, and Lugosi, Gabor. A prob¬ 
abilistic theory of pattern recognition. Springer, 1996. 

Dierkes, Ulrich, Hildebrandt, Stefan, Kuster, Albrecht, and 
Wohlrab, Ortwin. Minimal surfaces. Springer, 1992. 

Donoho, David and Grimes, Carrie. Hessian eigen- 
maps: Locally linear embedding techniques for high¬ 
dimensional data. Proceedings of the National Academy 
of Sciences, 100(10):5591-5596,2003. 

FMD. http://people.csail.mit.edu/celiu/CVPR2010/FMD/. 
Accessed: 2015-06-01. 

Goodfellow, Ian J, Shlens, Jonathon, and Szegedy, Chris¬ 
tian. Explaining and harnessing adversarial examples. 
arXiv preprint arXiv: 1412.6572, 2014. 

Guckenheimer, John and Worfolk, Patrick. Dynamical sys¬ 
tems: some computational problems. In Bifurcations and 
periodic orbits of vector fields (Montreal, PQ, 1992), 
volume 408 of NATO Adv. Sci. Inst. Ser. C Math. Phys. 
Sci., pp. 241-277. Kluwer Acad. Publ., Dordrecht, 1993. 

LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner, 
Patrick. Gradient-based learning applied to document 
recognition. Proceedings of the IEEE, 86(11):2278— 
2324, 1998. 

Lin, Tong and Zha, Hongbin. Riemannian manifold learn¬ 
ing. IEEE Trans, on Pattern Analysis and Machine In¬ 
telligence (PAMI), 30(5):796-809,2008. 

Lin, Tong, Xue, Hanlin, Wang, Ling, and Zha, Hongbin. 
Total variation and Euler’s elastica for supervised learn¬ 
ing. Proc. International Conf. on Machine Learning 
(ICML), 2012. 

Lin, Tong, Xue, Hanlin, Wang, Ling, Huang, Bo, and Zha, 
Hongbin. Supervised learning via euler’s elastica mod¬ 
els. Journal of Machine Learning Research, 16:3637- 
3686, 2015. 

Mantegazza, Carlo. Lecture Notes on Mean Curva¬ 
ture Flow, volume 290 of Progress in Mathematics. 
Birkhauser/Springer Basel AG, Basel, 2011. 

MNIST. http://http://yann.lecun.com/exdb/mnist/. Ac¬ 

cessed: 2015-06-01. 

Mumford, David and Shah, Jayant. Optimal approxima¬ 
tions by piecewise smooth functions and associated vari¬ 
ational problems. Communications on pure and applied 
mathematics, 42(51:577-685,1989. 



Class Probability Estimation via Differential Geometric Regularization 


Osher, Stanley and Sethian, James. Fronts propagating 
with curvature-dependent speed: algorithms based on 
hamilton-jacobi formulations. Journal of Computational 
Physics, 79(1): 12—49,1988. 

Roscher, Ribana, Forstner, Wolfgang, and Waske, Bjorn. I 
2 vm: incremental import vector machines. Image and 
Vision Computing , 30(4):263-278,2012. 

Roweis, Sam and Saul, Lawrence. Nonlinear dimensional¬ 
ity reduction by locally linear embedding. Science, 290 
(5500):2323-2326,2000. 

Scholkopf, Bernhard and Smola, Alexander. Learning with 
kernels: Support vector machines, regularization, opti¬ 
mization, and beyond. MIT press, 2002. 

Sethian, James Albert. Level set methods and fast marching 
methods: evolving interfaces in computational geometry, 
fluid mechanics, computer vision, and materials science, 
volume 3. Cambridge university press, 1999. 

Steinwart, Ingo. Consistency of support vector machines 
and other regularized kernel classifiers. IEEE Trans. In¬ 
formation Theory, 51(1): 128—142,2005. 

Stone, Charles. Consistent nonparametric regression. An¬ 
nals of Statistics, pp. 595-620, 1977. 

Tenenbaum, Joshua, De Silva, Vin, and Langford, John. A 
global geometric framework for nonlinear dimensional¬ 
ity reduction. Science, 290(5500):2319-2323,2000. 

Vapnik, Vladimir Naumovich. Statistical learning theory, 
volume 1. Wiley New York, 1998. 

Varshney, Kush and Willsky, Alan. Classification using 
geometric level sets. Journal of Machine Learning Re¬ 
search, 11:491-516,2010. 

VLFeat. http://www.vlfeat.org/applications/apps.html. Ac¬ 
cessed: 2015-06-01. 

Zhang, Zhenyue and Zha, Hongyuan. Principal manifolds 
and nonlinear dimensionality reduction via tangent space 
alignment. SIAM Journal on Scientific Computing, 26 
(1):313—338,2005. 

Zhou, Dengyong and Scholkopf, Bernhard. Regularization 
on discrete spaces. In Pattern Recognition, pp. 361-368. 
Springer, 2005. 

Zhu, Ji and Hastie, Trevor. Kernel logistic regression and 
the import vector machine. Journal of Computational 
and Graphical Statistics, 2005. 



Class Probability Estimation via Differential Geometric Regularization 


A. Proof of Theorem 1 

Proof. For / : R N ->• A i_1 C R L , 

{rj = rj(x) = (0,..., l ,..., 0, fj,..., ff) : j = 1,... N} 

is a basis of the tangent space T x gr(f) to gr(/). Here 
= d x jf l . Let {ei} be an orthonormal frame of T x gr(f). 
We have 

ei = Bj r 3 

for some invertible matrix Bj. 

Define the metric matrix g for the basis { r :i } by 

9 = ( 9kj ) with g kj = r k ■ r 3 = S k j + f{f}. 

Then 

&ij — * hy — B, t BjV k - Vf — Bj B 3 g kt 

^ I = (BB T )g ^ BB t =g~\ 

Thus BB t is computable in terms of derivatives of /. 

Let D u w be the R Ar+L directional derivative of w in the 
direction u. Then 

Tr II = P v D ei e i = P^D^B^n = BjP , 'D rj B^r k 
= BjP v [{D rj Bj)r k ] + BjBjD r .r k 
= Bj Bj P" D rj r k 
= (, g-'y k P v D rj r k , 


Thus 

Tr II (17) 

N 

= {9 1 y k P v {^i -■-, 0, /fcy, • • ■, f kj ) (18) 

= (g-y k ( 

N 

P T [(g—^yk (0) ■ ■ • j o, f kj , ■ • ■ 5 f kj ')} 

N 

= (ff _ 1 ) jfc ( 0 ,.o , f^) 

JV 

= (g^y f k j,-f kj ) 

~{g- l y k {g- l ) rs (. frsf))rk 

= {g- 1 ) ii (o,...-(g~ 1 rfr.fi,--A (19) 

rl / —l\rs fa ra f 1 t*L / — l\rs fa fa fL \ 

J ji ) J rsJ i J j » * • * 5 J ji \y ) J rsJ i J j I 5 

after a relabeling of indices. Therefore, the last /, compo¬ 

nent of Tr II are given by 

TrII L = (g-^^-ig^rffJffj,..., 

-(ff _1 r/r“ /“//)■ 

□ 


since P v r k = 0. 
We have 


B. An Easy Example with Bayes Consistency 


Tk = (o ,...,l,...,fl(x 1 ,...,x N ),...,f£(x 1 ,...,x N )) 


2=1 

so in particular, 0* N+J: V fe = 0 iff > N. Thus 

N . , 

D rj r k = (0,, 0, f k j, ■ - ■, f k j). 


We now give an example with a loss function that enables 
easy Bayes consistency proof under some mild initializa¬ 
tion assumption. Related notation is summerized in §C. 

For ease of reading, we change the notation for empirical 
penalty V-r m in the Appendix to Vd, he., V = Pi) + A Vg- 
Vd measures the deviation of gr(/) from the mapped train¬ 
ing points, a natural geometric distance penalty term is an 
L 2 distance in R L from f(x) to the averaged z component 
of the /,'-nearest training points: 


So far, we have 

Tr II = {g~ l y k P I '(0, ■ • • j o') flj, ■ ■ - , fkj)- 


Since g is given in terms of derivatives of /, we need to 
write P v = I — P T in terms of derivatives of /. For any 
u £ R Ar+ ' L , we have 

P T u = ( P T u ■ efjei = (u ■ Bjrj)BjV k 

= B l B i{ u - r i) r k 
= {g~ 1 Y k {u-r j )r k . 


P D (f) = B D,Tm,k(f) 



where d is the Euclidean distance in R L , z t is 
the vector of the last L components of (x, . z,) = 

(x] ...., xjf, z\, ..., Zj ), with Xj the i th nearest neighbor 
of x in Tm, and dx is the Lebesgue measure. The gradient 
vector field is 


V(R D , Tm ,k)f{x,f{x)) = j - z,). 

* 2=1 
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However, V(i?£>,r m ,fc).f is discontinuous on the set V of 
points x such that x has equidistant training points among 
its k nearest neighbors. V is the union of (TV — 1)- 
dimensional hyperplanes in X, so V has measure zero. 
Such points will necessarily exist unless the last L com¬ 
ponents of the mapped training points are all 1 or all 0. To 
rectify this, we can smooth out V(T?£>,Tm,fc)/ to a vector 
field 

k 

Vdj ,0 = - Zi). (21) 

i=1 

Here is a smooth damping function close to the sin¬ 
gular function Sp, which has Sp(x) = 0 for x £ V and 
5p(x) = 1 for x (/_ D. Outside any open neighborhood of 
V, S7Rp,T m ,k = Vn.frf for (j> close enough to Sp. 

Recall the geometric penalty from the submission, i.e., 
Va(f) = fg r (f) dvol, with the geometric gradient vector 

field being Vgj = — Tr II L . 

Then the gradient vector field Vt 0 t,x,m,f,<f> of this example 
penalty V is, 

Vtot,x,mj,<i> = Wf = Vdj,4> + A Vgj 

= ^^(/(aO-zO-ATr Ilf22) 

i =1 

B.l. Consistency analysis 

For a training set T m , we let f Tm = (fr m , ■ ■ •, fr m ) 

be the class probability estimator given by our approach. 

We denote the generalization risk of the corresponding 

plug-in classifier hf Tm by R P {f Tm ) = E P [t hfT ^ (x) ^ y }- 

The Bayes risk is defined by R* P = inf Rp(h) = 

h-.x^y 

(as)^]. Our algorithm is Bayes consistent if 
lim Rp {fj- ) = Rp holds in probability for all distribu- 

tions P on X x y. Usually, gradient flow methods are ap¬ 
plied to a convex functional, so that a flow line approaches 
the unique global minimum. If the domain of the functional 
is an infinite dimensional manifold of (e.g. smooth) func¬ 
tions, we always assume that flow lines exist and that the 
actual minimum exists in this manifold. 

Because our functionals are not convex, and because we are 
strictly speaking not working with gradient vector fields, 
we can only hope to prove Bayes consistency for the set 
of initial estimators in the stable manifold of a stable fixed 
point (or sink) of the vector field (Guckenheimer & Wor- 
folk, 1993). Recall that a stable fixed point / 0 has a maxi¬ 
mal open neighborhood, the stable manifold Sf 0 , on which 
flow lines tend towards f 0 . For the manifold M., the sta¬ 
ble manifold for a stable critical point of the vector field 
Vtot,X,m,f ,<t> is infinite dimensional. 

The proof of Bayes consistency for multiclass (including 


binary) classification follows these steps: 

Step 1: lim R* D p x = 0. 

A-^0 ’ ’ 

Step 2: lim Rp P (f n ) = 0 => lim Rp{f n ) = R* P . 

n—too n—too 

Step 3: For all f&M= Maps(T\ A L " 1 ), \Rd,tM) - 
RoAf)\ 0 in probability. 

Proofs of these steps are provided in following sub¬ 
sections. For the notation see §C. R* D P x is the 
minimum of the regularized D risk Rp,p,x{f) 
for /: Rp,p,x{f) = Ro,p(f) + APg(/), with 
RoAf) = fx d 2 (f( x )y v( x ))d x the D-risk. 
Also, R D ,r m ,x{f) = R D,T m (f) + A P G {f), with 
R D,T m (f) = fx d 2 (^f(x), A i dx the empirical 
D-risk. 


Theorem 2 (Bayes Consistency). Let m be the size of 
the training data set. Let fi Xm £ Sf D T , the sta¬ 
ble manifold for the global minimum f D x of Rp -j- m x , 
and let f n x m ^ be a sequence of functions on the flow 
line ofVt 0 t, X ,m,f,<l> starting with f 1 x m with the flow time 
t n ->• oo as n ->• oo. Then Rp(f n X m <t>) > R* p 

in probability for all distributions P on X x Jt, if k/m —> 0 
ajm-> oo. 


Proof In the notation of §C, if f D j- m x is a global mini¬ 
mum for Rp t T m x, then outside of V, f D T x is the limit 
of critical points for the negative flow of Vtot,x,m,f,(j> as 
0 —> Sp. To see this, fix an e, neighborhood V ti of V. 
For a sequence <pj —> Sp 7 V tot x ! m,.f,<p j is independent of 
j 2S on X \ T> ei , so we find a function a criti¬ 
cal point of Vt otiA , m> /, 0 i(ej) , equal to .f p : r m>X on x \ V u- 
Since any x 0 T) lies outside some 'D f ,, the sequence /, 
converges at x if we let e, —> 0. Thus we can ignore the 
choice of <f in our proof, and drop 0 from the notation. 

For our algorithm, for fixed A, m, we have as above 

lim fn,X,m = f D,T m ,X’ SO 
n—too 

lim R D ,T m ,x(f n,X,m) = RD,r m ,x(f D,T m ,x)i 

n—> oo ’ 

for f 1 £ Rf D t_ x . By Step 2, it suffices to show 
RD,p{f p T m x) 0. In probability, we have V<5 > 
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0, 3m > 0 such that 

RD,p(f D,T m , a) 

< RD,p{f D,T m ,x) + ^g(/ 

< RD,Tnt(f D,T m ,x) + ^g(/ D,T m ,x) + jj (Step 3) 

= RD,T m ,\{f + g 

£ 

< RD,T m ,\{fD,p,\) + 77 (minimality of f DiTm ,\) 

= RD,T m (f D,p,\) + XVc(f D,p,\) + - 

< RD,p{f D,p,\) + XPG{fD,P,\) + "jj" (Step 3) 

= RD,P,\(f D,p,\) + — = R*D,P,\ + — 

< 5, (Step 1) 

for A close to zero. Since Ro,p(f d r m a) — 0, we are 
done. □ 


Lemma 4. (Step 2 /or a subsequence) 

lim R D ,p{f n ) = 0 => lim R P (f ) = R* P 

n—y oo i—foo 

for some subsequence {/njgj. of{f n }. 

Proof. The left hand side of the Lemma is 


d 2 (f„(x),ri(x))dx 0, 


/ X 


which is equivalent to 


/ d 2 (f n (x),rj(x))iJ,(x)dx -> 0, 


(23) 


J x 

since A' is compact and p is continuous. Therefore, it suf¬ 
fices to show 


d 2 (f n (x) 1 v(x))p(x)dx ->■ 0 


(24) 


B.2. Step 1 

Lemma 3. (Step 1) lim R* D p A = 0. 

A—>0 ’ ’ 

Proof After the smoothing procedure in §3.1 for the dis¬ 
tance penalty term, the function Rd,p,\ '■ AI —> R is con¬ 
tinuous in the Frechet topology on Ai . We check that the 
functions Rd.p, a : M. —> R are equicontinuous in A: for 
fixed f 0 £ M and e > 0, there exists <5 = 6(f 0 ,e) such 
that | A — A'| < S =£* \Rd,p,\ {f o) — -Rd,p,A'(/ o)l < £• 
This is immediate: 

\RD,P,\(fo) - RD,P,X'(fo)\ = l(^ ~ ^O'PgC/o)! < G 

if 6 < e/\'PG(fa)\- It is standard that the infimum inf R\ 
of an equicontinuous family of functions is continuous in 

S ° p n J. R*n,p,\ = R*d,p,\=o = RdAv) = 0- Q 


We recall that L 2 convergence implies pointwise conver¬ 
gence a.e, so (23) implies that a subsequence of f n , also 
denoted f n , has f n —> r](x) pointwise a.e. on X. (By 
our assumption on p(tc), these statements hold for either 
p or Lebesgue measure.) By Egorov’s theorem, for any 
e > 0, there exists a set B e C X with p(f? e ) < e such that 
f n ^ r](x) uniformly on X \ If. 

Fix 6 > 0 and set 

Zs = {x £ A” : #{argmax rj £ (x)} = 1, 
eey 

Imaxt/fcc) — submax ri e (x )I < <5), 
tey fe y 

where submax denotes the second largest element in 

tey 

{? 7 1 (x),..., rj L (x)}. For the moment, assume that Zq = 

{x £ X : #{argmax q e (x)} > 1} has p{Z 0 ) = 0. 
eey 


B.3. Step 2 

We assume that the class probability function r](x) : 
R w —>• R L is smooth, and that the marginal distribution 
p(x) is continuous. We also let p denote the correspond¬ 
ing measure on X. 

Notation: 


hf(x) = argmax{/ £ (a;), i £ >>}. 


Of course. 


^-h f (x)^y 


1, h f (x)^y, 
0, h f (x) = y. 


It follows easily 2 that p(Zf) —> 0 as S —> 0. On X \ (Zs U 
Be), we have 1 h fn (*)? y = 1 h v (*)jt y for n > N s . Thus 

Ep[la'\(Z 5 UB e ) :II -/i/„(*)^y] = ^p[ 1 X\(Z 5 \JB e )^h v ( x )^y]- 

2 Let Ak be sets with Ak +1 C Ak and with p(fl^L 1 Ak) = 0. 
If p(Afc) t 4 0, then there exists a subsequence, also called Ak, 
with fi(Ak) > K > 0 for some K. We claim p,(C\Ak) > K, a 
contradiction. For the claim, let Z = (~l Ak- If p(Z) > p(Afc) for 
all k, we are done. If not, since the Ak are nested, we can replace 
Ak by a set, also called Ak, of measure K and such that the new 
Ak are still nested. For the relabeled Z = C\Ak, Z C Ak for all 
k, and we may assume p,(Z) < K. Thus there exists Z' C Ai 
with Z' fl Z = 0 and p(Z') > 0. Since p(Ai) = K, we must 
have Ai fl Z' ^ 0 for all i. Thus fl Ai is strictly larger than Z, 
a contradiction. In summary, the claim must hold, so we get a 
contradiction to assuming p(Afc) -fr 0. 
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(Here 1 a is the characteristic function of a set A.) 

As b —» 0, 

^p[^X\{Z s UB e )^h fn { X )^y\ ->• ^p[lx\B e 1h fri ( X )^y}- 


Proof Since R.D,p(f) is a constant for fixed / and P, 
convergence in probability will follow from weak conver¬ 
gence, i.e.. 


VtJ\Rd,tM) ~ RD,p(f)W 


4 0 . 


and similarly for f n replaced by r/(x). During this process, j^ ave 

Ng presumably goes to oo, but that precisely means 


Ep [l.v\B e ^-h n (x)^y 

Since 

Ep (*)#!/] ~ ^ J P[^-hf n (x)^y] < e ) 

and similarly for t](x), we get 

— Ep]!^ ( x )=£y\ 

lirn^ Ep[lh /n ( X )^ y \ - ^lim^ E P [l*\s e H fn (x)^y) 

1 1 i^o E P[ lA ’\ fl ‘ 1 '*/n( i,! )^] _ Ep[l A .\ Be l^ r| ( a .)^ y ] 
- Ep[l hri ^ y \ 


I RD,Tm(f) ~ RoAf)\ 


' X 


d 2 ( / W>rE i! - d 2 {f{x),rj(x)) 


< 


2=1 

k 


dx 


d 2 /(*)’ T, E M ~ d 2 (f(x),r/(x)) 


2=1 


cfcc. 


n—foo 

< 


Set a = /(a;) - | X/E ^ b = f ( x ) “ » 7 (*)- Then 


n—too 

+ 


IMI2 - l|b|l!| 

L L 


L 

^=i fci 

— 

E(°? ~ 


< 


< 3e. 

(Strictly speaking, lim^-^ Ep[l h (cc)^?/] i s first limsup 
and then liminf to show that the limit exists.) Since e is 
arbitrary, the proof is complete if p{Zf) = 0. 


If fi(Zo) > 0, we rerun the proof with X replaced by Z Q . to show that 
As above, f n \z 0 converges uniformly to rj(x) off a set of 
measure e. The argument above, without the set Zg, gives 


E | a 2 - b 2 1 < 2^2 \ a e - b e \max{\ae\, \b e \} 

£=1 (=1 
L 

2 j>~ 

p=i 

since f e (x), jEi E r l t '{ x ) C [0,1]. Therefore, it suffices 


< 




(x)^yft(x')dx 


1 h v (£C ) 7^ 2 / ® ^® • 


1=1 
m—foo 


> X 


((/V) - T 


> 0 , 


We then proceed with the proof above on A" \ -Zq . 


□ so the result follows if 


Corollary 5. (Step 2 in general) For our algorithm, 
lim RD,p{fn,x,m ) = 0 => lim Rp{f n , x ,m) = R *p■ 


lim E Tm , a 

m—foo ’ 




= 0 for all f. (25) 


Proof. Choose / 1A to as in Theorem 2. Since By Jensen’s inequality (E [/]) 2 < E(/ 2 ), (25) follows if 
Vtot,\,mj n x has pointwise length going to zero as n —> 


oo, {fn a m( x )} i s a Cauchy sequence for all x. This im¬ 
plies that f n A m , and not just a subsequence, converges 
pointwise to 77 . □ 

B.4. Step 3 


lim £ 7 ^ 3 

m—foo ’ 


1 




= 0 for all f. 


(26) 


Let Vk.mi 30 ) = i Ei A ■ Then is actually an estimate 
Lemma 6. (Step 3)Ifk -y 00 and k/m ->0flsm->oo, of the^class probability j/(®) by the /c-Nearest Neighbor 
then for f £ Maps(A’, A L_1 ), 


|-RD,r m (/) - -Rd,p(/)| 


/or all distributions P that generate P m . 


0 in probability , 


rule. Following the proof of Stone’s Theorem (Stone, 1977; 

0 , 

□ 


Devroye et ah, 1996), if k qq and k/m m ^ 0 °> 0, 


(26) holds for all distributions P. 
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C. Notation 


hf(x)=&rgm&x{f e (x),£ey} : 

-**■ h f(x)^y = 

R P (f) = Ep[l hf ( x )jt v ] : 

V( x ) = (ri 1 (x) 1 ...,r ] L (x)) : 

R* P = Rp(r]) : 

(D-risk for our Vd) Ro,p{f) = 
(empirical D-nsk)R D , Tm (/) = RD,T m ,k{f) = 


(volume penalty term) Vc{f) = 

Rd,pM) = RoAf) + Wc(f) ■ 
RD,T m ,\(f) = RD,T m (f) + APg(/) : 

f D,P,\ = 

R*D,P,\ ~ R,D,P,\(f D,P,\) '■ 

fD,T m ,X = f D,T m ,k,\ '■ 

Note that we assume f D P x and f D j- x exist. 


plug-in classifier of estimator f : X A L 1 

f 1) hfi x ) 7^ Vi 
\ 0, hf(x) = y. 

generalization risk for the estimator / 

class probability function: rf(x) = P(y = £\x) 

Bayes risk 


d 2 (f(x),r](x))dx 


lx 



where z, is the vector of the last L components of 
(xi, Zi), with x, the i th nearest neighbor of x in T m 



regularized D-risk for estimator / 
regularized empirical D-risk for estimator f 
function attaining the global minimum for Rd,p,\ 
minimum value for Rd,p,\ 

function attaining the global minimum for RD,T m ,x{f) 



